I recently had the opportunity to teach some tidyverse code to colleagues and classmates in a series of workshops. Some had already dabbled in R or other languages, but it was the first time that the majority of participants had written a single line of code.
In preparation for this workshop series, I found a lot of inspiration in Michael Levy’s presentation on teaching R, which itself echoes principles preached by other R advocates.
One of my biggest takeaways is that live coding works.
Writing code in real time shows every single step we make from opening the IDE, to reshaping the data, to debugging inevitable errors, to rendering a final report.
Within a few short weeks of learning to code, it is surprising how many tiny steps become automatic and taken for granted. Tack on a couple more months and newcomers will think you’re speaking in an entirely different language when you’re explaining something requiring context they simply haven’t yet encountered. Add a few years and… yeesh.
Something that frustrated me when I first started is that code explanations often seem to be written in such a way that dismisses how difficult the basics can be. I’m half-convinced that for some folks, the trauma was so great that they have simply blocked it from memory.
Live coding enforces a maximum speed in moving through exercises, which not only gives students more time to digest what you’re doing. It also provides more opportunity to for them to ask questions on details you might find trivial, but only because you already suffered through them.
I also think that the benefits of live coding go both ways. I found myself answering questions that framed things in ways that I had not even considered. Additionally, I have a better sense now of which concepts need to be covered in more detail, as they weren’t necessarily as inuitive for others as they were for me. On the flip-side, concepts that I remember struggling with may not be difficult at all for others to understand.
I’ll cut the bloggyness of this blog post here, and get to the meat of what we covered. Before I forget, the title image came originally from R Memes for Statistical Fiends. If you’re reading this, you’ll likely find their memes of satisfactory dankness.
# install.packages("devtools")
# install.packages("tidyverse")
library(tidyverse)
# loads {ggplot2}, {tibble}, {tidyr}, {readr}, {purrr}, {dplyr}, {stringr}, and {forcats}
# install.packages("gapminder")
library(gapminder)
# loads the gapminder data set
# install.packages("kableExtra")
library(knitr)
library(kableExtra)
# just to prettify printed tables when knitting
# library(scales)
In the following exercises, gm.data.frame will be used to demonstrate actions that use {base} R methods for data.frame operations while gm_df will be used to to demonstrate {tidyverse} methods for tibble operations.
gm.data.frame <- as.data.frame(gapminder)
gm_df <- gapminder
tibblesclass(gm.data.frame)
## [1] "data.frame"
class(gm_df)
## [1] "tbl_df" "tbl" "data.frame"
tibbles are opinionated data.frames that keep everything that is helpful about data.frames changes what is unhelpful, and adds methods that makes thems even more useful.
Printing gm.data.frame dumps the whole data set to the console, typically requiring head() to limit the output.
head(gm.data.frame)
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
Printing the gm_df provides the dimensions, data type of each column, and only prints the first 10 rows.
gm_df
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
## 7 Afghanistan Asia 1982 39.9 12881816 978
## 8 Afghanistan Asia 1987 40.8 13867957 852
## 9 Afghanistan Asia 1992 41.7 16317921 649
## 10 Afghanistan Asia 1997 41.8 22227415 635
## # ... with 1,694 more rows
%>%The pipe (%>%) is used to chain operations together. Underneath the hood, it’s taking the value on the left-hand side of %>% and using it as the first argument of the function on the right-hand side of %>%.
For example, these 2 lines are doing the exact same thing.
head(gm_df)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
gm_df %>% head()
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
For simple operations involving 1 function, %>% is only slightly beneficial in that it improves readability as the flow of operations go from left to right.
%>% become truly useful when you need to perform multiple operations in succession.
As an arbitrary example, let’s say that we want to select the head() (first 6 rows) of gm.data.frame and convert it to a tibble.
Without %>%, we can do this in a few ways.
gm.data.frame’s head() and assign it to no_pipe_1no_pipe_1 to a tibble with as_tibble() and assign it to no_pipe_2no_pipe_1 <- head(gm.data.frame)
no_pipe_2 <- as_tibble(no_pipe_1)
no_pipe_2
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## * <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
head() inside of as_tibble().as_tibble(head(gm.data.frame))
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## * <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
With %>%, we can chain these actions together in the order in which they occur, which is also the way we read English.
gm_dfhead() (keeping the top 6 rows)as_tibble() (converting it to a tibble data frame)gm_df %>% head() %>% as_tibble()
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
In practice, it’s usually best to place each of the functions on a separate line as it facilitates debugging and further improves readability.
gm_df %>%
as_tibble() %>%
head()
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
From here on, you’ll notice prettify(). This is only being used to print tables in a clean format when the document is knit()ted.
data.frames will print a default maximum of 3 rows while tibbles will print a default maximum of 10 rows.
prettify <- function(df, n = NULL, cols_changed = NULL, rows_changed = NULL){
if(is.null(n)) n <- ifelse(is.tibble(df), 10, 3)
pretty_df <- df %>%
head(n) %>%
kable(format = "html") %>%
kable_styling(bootstrap_options = c("striped", "bordered", "condensed",
"hover", "responsive"),
full_width = FALSE)
if(!is.null(cols_changed)){
pretty_df <- pretty_df %>%
column_spec(cols_changed, bold = T, color = "black", background = "#C8FAE3")
}
if(!is.null(rows_changed)){
pretty_df <- pretty_df %>%
row_spec(rows_changed, bold = T, color = "black", background = "#C8FAE3")
}
return(pretty_df)
}
gm.data.frame %>%
prettify()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
gm_df %>%
prettify()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
| Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 |
| Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 |
| Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 |
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 |
You’ll also see a toy data set for the introductory examples that start each section.
sample_countries <- c("Tunisia", "Nicaragua", "Singapore", "Hungary", "New Zealand",
"Nigeria", "Brazil", "Sri Lanka", "Ireland", "Australia")
sample_df <- gm_df %>%
filter(year == 2007,
country %in% sample_countries)
sample_df %>%
prettify()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.367 |
| Brazil | Americas | 2007 | 72.390 | 190010647 | 9065.801 |
| Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.944 |
| Ireland | Europe | 2007 | 78.885 | 4109086 | 40675.996 |
| New Zealand | Oceania | 2007 | 80.204 | 4115771 | 25185.009 |
| Nicaragua | Americas | 2007 | 72.899 | 5675356 | 2749.321 |
| Nigeria | Africa | 2007 | 46.859 | 135031164 | 2013.977 |
| Singapore | Asia | 2007 | 79.972 | 4553009 | 47143.180 |
| Sri Lanka | Asia | 2007 | 72.396 | 20378239 | 3970.095 |
| Tunisia | Africa | 2007 | 73.923 | 10276158 | 7092.923 |
With tibbles, %>%, and the concept of tidy data covered, let’s take a dive.
{dplyr}{dplyr} provides a grammar of data manipulation and a set of verb functions that solve most common data carpentry challenges in a consistent fashion.
glimpse()select()filter()arrange()mutate()summarize()group_by()glimpse()In addition to the summary(), dim()ensions, and str()ucture functions that can be used to inspect data, you can now use {dplyr}’s glimpse().
summary(gm.data.frame)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
dim(gm.data.frame)
## [1] 1704 6
str(gm.data.frame)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
glimpse(gm_df)
## Observations: 1,704
## Variables: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
select() columnssample_df %>%
prettify()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.367 |
| Brazil | Americas | 2007 | 72.390 | 190010647 | 9065.801 |
| Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.944 |
| Ireland | Europe | 2007 | 78.885 | 4109086 | 40675.996 |
| New Zealand | Oceania | 2007 | 80.204 | 4115771 | 25185.009 |
| Nicaragua | Americas | 2007 | 72.899 | 5675356 | 2749.321 |
| Nigeria | Africa | 2007 | 46.859 | 135031164 | 2013.977 |
| Singapore | Asia | 2007 | 79.972 | 4553009 | 47143.180 |
| Sri Lanka | Asia | 2007 | 72.396 | 20378239 | 3970.095 |
| Tunisia | Africa | 2007 | 73.923 | 10276158 | 7092.923 |
sample_df %>%
select(country, pop) %>%
prettify()
| country | pop |
|---|---|
| Australia | 20434176 |
| Brazil | 190010647 |
| Hungary | 9956108 |
| Ireland | 4109086 |
| New Zealand | 4115771 |
| Nicaragua | 5675356 |
| Nigeria | 135031164 |
| Singapore | 4553009 |
| Sri Lanka | 20378239 |
| Tunisia | 10276158 |
The select() family is used to choose columns to keep. You can use bare (unquoted) names.
select() columns by specific names.
gm_df’s country and pop columnsgm_df %>%
select(country, year, pop) %>% # select columns by specific names
prettify()
| country | year | pop |
|---|---|---|
| Afghanistan | 1952 | 8425333 |
| Afghanistan | 1957 | 9240934 |
| Afghanistan | 1962 | 10267083 |
| Afghanistan | 1967 | 11537966 |
| Afghanistan | 1972 | 13079460 |
| Afghanistan | 1977 | 14880372 |
| Afghanistan | 1982 | 12881816 |
| Afghanistan | 1987 | 13867957 |
| Afghanistan | 1992 | 16317921 |
| Afghanistan | 1997 | 22227415 |
select() a range of columns by name
gm_df’s continent column and all columns from lifeExp to gdpPercapgm_df %>%
select(continent, lifeExp:gdpPercap) %>% # select columns name range
prettify()
| continent | lifeExp | pop | gdpPercap |
|---|---|---|---|
| Asia | 28.801 | 8425333 | 779.4453 |
| Asia | 30.332 | 9240934 | 820.8530 |
| Asia | 31.997 | 10267083 | 853.1007 |
| Asia | 34.020 | 11537966 | 836.1971 |
| Asia | 36.088 | 13079460 | 739.9811 |
| Asia | 38.438 | 14880372 | 786.1134 |
| Asia | 39.854 | 12881816 | 978.0114 |
| Asia | 40.822 | 13867957 | 852.3959 |
| Asia | 41.674 | 16317921 | 649.3414 |
| Asia | 41.763 | 22227415 | 635.3414 |
select() a column with -
select() all of gm_df’s columns except lifeExpgm_df %>%
select(-lifeExp) %>% # deselect column by name
prettify()
| country | continent | year | pop | gdpPercap |
|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 14880372 | 786.1134 |
| Afghanistan | Asia | 1982 | 12881816 | 978.0114 |
| Afghanistan | Asia | 1987 | 13867957 | 852.3959 |
| Afghanistan | Asia | 1992 | 16317921 | 649.3414 |
| Afghanistan | Asia | 1997 | 22227415 | 635.3414 |
select() a range of columns by name
select() all of gm_df’s columns except those between lifeExp and gdpPercapgm_df %>%
select(-c(lifeExp:gdpPercap)) %>% # deselect column by name range
prettify()
| country | continent | year |
|---|---|---|
| Afghanistan | Asia | 1952 |
| Afghanistan | Asia | 1957 |
| Afghanistan | Asia | 1962 |
| Afghanistan | Asia | 1967 |
| Afghanistan | Asia | 1972 |
| Afghanistan | Asia | 1977 |
| Afghanistan | Asia | 1982 |
| Afghanistan | Asia | 1987 |
| Afghanistan | Asia | 1992 |
| Afghanistan | Asia | 1997 |
select() column by index
select() gm_df’s 4th columngm_df %>%
select(4) %>% # select column by index
prettify()
| lifeExp |
|---|
| 28.801 |
| 30.332 |
| 31.997 |
| 34.020 |
| 36.088 |
| 38.438 |
| 39.854 |
| 40.822 |
| 41.674 |
| 41.763 |
select() a column by index
select() all of gm_df’s columns except for the 4th columngm_df %>%
select(-4) %>% # deselect column by index
prettify()
| country | continent | year | pop | gdpPercap |
|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 14880372 | 786.1134 |
| Afghanistan | Asia | 1982 | 12881816 | 978.0114 |
| Afghanistan | Asia | 1987 | 13867957 | 852.3959 |
| Afghanistan | Asia | 1992 | 16317921 | 649.3414 |
| Afghanistan | Asia | 1997 | 22227415 | 635.3414 |
select() a range of columns by index
select() all of gm_df’s columns except those between the 3rd and 5th columnsgm_df %>%
select(-c(3:5)) %>% # deselect columns by index range
prettify()
| country | continent | gdpPercap |
|---|---|---|
| Afghanistan | Asia | 779.4453 |
| Afghanistan | Asia | 820.8530 |
| Afghanistan | Asia | 853.1007 |
| Afghanistan | Asia | 836.1971 |
| Afghanistan | Asia | 739.9811 |
| Afghanistan | Asia | 786.1134 |
| Afghanistan | Asia | 978.0114 |
| Afghanistan | Asia | 852.3959 |
| Afghanistan | Asia | 649.3414 |
| Afghanistan | Asia | 635.3414 |
ggplot() Exercise 1{ggplot2} is monster of a package used for data visualization that follows The Grammar of Graphics.
{ggplot2} takes R’s powerful graphics capabilities and makes them more accessible by taking care of many plotting tasks that are often tedious, while allowing for deep, lower-level customization.
ggplot(your data, aes(x =x values, y =y values)) +
geom_boxplot() the type of plot geometry desired
Steps
gm_df, select the lifeExp column%>%) the result to ggplot()aes()thetic values
lifeExp for the x values
y are counts of its x values, so we don’t provide them heregeom_histogram() as the geometry of the plotgm_df %>% # data frame: Data
select(lifeExp) %>% # columns to keep: Data
ggplot(aes(x = lifeExp)) + # x values: Aesthetics
geom_histogram() # histogram: Geometries
Figure 1
filter() Rowssample_df %>%
select(country, lifeExp) %>%
prettify()
| country | lifeExp |
|---|---|
| Australia | 81.235 |
| Brazil | 72.390 |
| Hungary | 73.338 |
| Ireland | 78.885 |
| New Zealand | 80.204 |
| Nicaragua | 72.899 |
| Nigeria | 46.859 |
| Singapore | 79.972 |
| Sri Lanka | 72.396 |
| Tunisia | 73.923 |
sample_df %>%
select(country, lifeExp) %>%
filter(lifeExp > 75) %>%
prettify(cols_changed = 2)
| country | lifeExp |
|---|---|
| Australia | 81.235 |
| Ireland | 78.885 |
| New Zealand | 80.204 |
| Singapore | 79.972 |
Use filter() to select rows using logic. Rows where a logical expression returns TRUE are kept and others are dropped.
filter() rows where numeric() values are greater or lesser than another value
filter() gm_df to only keep rows where gdpPercap < 500gm_df %>%
filter(gdpPercap < 500) %>%
prettify(cols_changed = 6)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Burundi | Africa | 1952 | 39.031 | 2445618 | 339.2965 |
| Burundi | Africa | 1957 | 40.533 | 2667518 | 379.5646 |
| Burundi | Africa | 1962 | 42.045 | 2961915 | 355.2032 |
| Burundi | Africa | 1967 | 43.548 | 3330989 | 412.9775 |
| Burundi | Africa | 1972 | 44.057 | 3529983 | 464.0995 |
| Burundi | Africa | 1997 | 45.326 | 6121610 | 463.1151 |
| Burundi | Africa | 2002 | 47.360 | 7021078 | 446.4035 |
| Burundi | Africa | 2007 | 49.580 | 8390505 | 430.0707 |
| Cambodia | Asia | 1952 | 39.417 | 4693836 | 368.4693 |
| Cambodia | Asia | 1957 | 41.366 | 5322536 | 434.0383 |
filter() rows using multiple logical expressions where all must be TRUE
filter() gm_df to only keep rows where year > 1990 and lifeExp < 40, and & are evaluated identically in filter()gm_df %>%
filter(year > 1990, lifeExp < 40) %>%
prettify(cols_changed = 3:4)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Rwanda | Africa | 1992 | 23.599 | 7290203 | 737.0686 |
| Rwanda | Africa | 1997 | 36.087 | 7212583 | 589.9445 |
| Sierra Leone | Africa | 1992 | 38.333 | 4260884 | 1068.6963 |
| Sierra Leone | Africa | 1997 | 39.897 | 4578212 | 574.6482 |
| Somalia | Africa | 1992 | 39.658 | 6099799 | 926.9603 |
| Swaziland | Africa | 2007 | 39.613 | 1133066 | 4513.4806 |
| Zambia | Africa | 2002 | 39.193 | 10595811 | 1071.6139 |
| Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.0386 |
filter() rows using multiple logical expressions where one must be TRUE
filter() gm_df to only keep rows where pop < 10000 or gdpPercap > 100000| means orgm_df %>%
filter(pop < 10000 | gdpPercap > 100000) %>%
prettify(cols_changed = 5:6)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Kuwait | Asia | 1952 | 55.565 | 160000 | 108382.4 |
| Kuwait | Asia | 1957 | 58.033 | 212846 | 113523.1 |
| Kuwait | Asia | 1972 | 67.712 | 841934 | 109347.9 |
filter() rows using a string
filter() gm_df to only keep rows where year is 1999 and continent is "Europe"== means is equal togm_df %>%
filter(year == 1997 & continent == "Europe") %>%
prettify(cols_changed = 2:3)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Albania | Europe | 1997 | 72.950 | 3428038 | 3193.055 |
| Austria | Europe | 1997 | 77.510 | 8069876 | 29095.921 |
| Belgium | Europe | 1997 | 77.530 | 10199787 | 27561.197 |
| Bosnia and Herzegovina | Europe | 1997 | 73.244 | 3607000 | 4766.356 |
| Bulgaria | Europe | 1997 | 70.320 | 8066057 | 5970.389 |
| Croatia | Europe | 1997 | 73.680 | 4444595 | 9875.605 |
| Czech Republic | Europe | 1997 | 74.010 | 10300707 | 16048.514 |
| Denmark | Europe | 1997 | 76.110 | 5283663 | 29804.346 |
| Finland | Europe | 1997 | 77.130 | 5134406 | 23723.950 |
| France | Europe | 1997 | 78.640 | 58623428 | 25889.785 |
ggplot() Exercise 2Steps
gm_df, select the continent, country, and gdpPercap columnsfilter() the rows to only keep those where continent == "Oceania"%>%) the result to ggplot()aes()thetic values
country for the x valuesgdpPercap for the y valuesgeom_boxplot() as the geometry of the plotgm_df %>% # data frame: Data
select(continent, country, gdpPercap) %>% # columns to keep: Data
filter(continent == "Oceania") %>% # rows to keep: Data
ggplot(aes(x = country, y = gdpPercap)) + # x and y values: Aesthetics
geom_boxplot() # box plot: Geometries
mutate() Columnssample_df %>%
select(country, pop) %>%
prettify()
| country | pop |
|---|---|
| Australia | 20434176 |
| Brazil | 190010647 |
| Hungary | 9956108 |
| Ireland | 4109086 |
| New Zealand | 4115771 |
| Nicaragua | 5675356 |
| Nigeria | 135031164 |
| Singapore | 4553009 |
| Sri Lanka | 20378239 |
| Tunisia | 10276158 |
sample_df %>%
select(country, pop) %>%
mutate(pop_in_thousands = pop / 1000) %>%
prettify(cols_changed = 3)
| country | pop | pop_in_thousands |
|---|---|---|
| Australia | 20434176 | 20434.176 |
| Brazil | 190010647 | 190010.647 |
| Hungary | 9956108 | 9956.108 |
| Ireland | 4109086 | 4109.086 |
| New Zealand | 4115771 | 4115.771 |
| Nicaragua | 5675356 | 5675.356 |
| Nigeria | 135031164 | 135031.164 |
| Singapore | 4553009 | 4553.009 |
| Sri Lanka | 20378239 | 20378.239 |
| Tunisia | 10276158 | 10276.158 |
Use mutate() to manipulate column values and create new columns.
In order to mutate() a column, use the name of the column you are manipulating and set its value using =.
Here’s a silly example:
gm_df
mutate() gm_df to create a column named planet and set its value to "Earth"gm_df %>%
mutate(planet = "Earth") %>%
prettify(cols_changed = 7)
| country | continent | year | lifeExp | pop | gdpPercap | planet |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | Earth |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | Earth |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | Earth |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | Earth |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | Earth |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | Earth |
| Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 | Earth |
| Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 | Earth |
| Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 | Earth |
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 | Earth |
Since we have gdpPercap and pop, we can calculate the values for a total_GDP column.
mutate() gm_df to set the results of a calculation on each row to a new column
pop * gdpPercap and assign the result to total_GDP inside mutate()gm_df %>%
mutate(total_GDP = pop * gdpPercap) %>%
prettify(cols_changed = 7)
| country | continent | year | lifeExp | pop | gdpPercap | total_GDP |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | 6567086330 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | 7585448670 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | 8758855797 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | 9648014150 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | 9678553274 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | 11697659231 |
| Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 | 12598563401 |
| Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 | 11820990309 |
| Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 | 10595901589 |
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 | 14121995875 |
Typically, mutate() is used to perform operations across columns in each individual row. You can also use summary functions to perform operations on individual columns (acting as vectors) that result in a vector that can be assigned to a column.
Makes sense, right??
Let’s calculate the z-score of each gdpPercap value for a specific year.
\[ z = \frac {x_i -\mu_x} {\sigma_x}\]
gdpPercapmean(gdpPercap)\(\sigma_x\) = the standard deviation of x = sd(gdpPercap)
mean(gdpPercap) from gdpPercapsd(gdpPercap)gdp_per_cap_z_scoregm_df %>%
filter(year == 1977) %>%
mutate(gdp_per_cap_z_score = (gdpPercap - mean(gdpPercap)) / sd(gdpPercap)) %>%
prettify(cols_changed = 7)
| country | continent | year | lifeExp | pop | gdpPercap | gdp_per_cap_z_score |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | -0.7805156 |
| Albania | Europe | 1977 | 68.930 | 2509048 | 3533.0039 | -0.4520380 |
| Algeria | Africa | 1977 | 58.014 | 17152804 | 4910.4168 | -0.2873247 |
| Angola | Africa | 1977 | 39.483 | 6162675 | 3008.6474 | -0.5147414 |
| Argentina | Americas | 1977 | 68.481 | 26983828 | 10079.0267 | 0.3307461 |
| Australia | Oceania | 1977 | 73.490 | 14074100 | 18334.1975 | 1.3179128 |
| Austria | Europe | 1977 | 72.170 | 7568430 | 19749.4223 | 1.4871476 |
| Bahrain | Asia | 1977 | 65.593 | 297410 | 19340.1020 | 1.4382004 |
| Bangladesh | Asia | 1977 | 46.923 | 80428306 | 659.8772 | -0.7956111 |
| Belgium | Europe | 1977 | 72.800 | 9821800 | 19117.9745 | 1.4116381 |
Here are other functions that can be used similarly:
| Summary Functions | |
|---|---|
first() |
min() |
last() |
max() |
nth() |
mean() |
n() |
median() |
n_distinct() |
var() |
IQR() |
sd() |
ggplot() Exercise 3Steps
gm_df, select() country, year, and gdpPercapfilter() the rows to keep only those where country is "Korea, Rep.", "Korea, Dem. Rep.", "Japan", or "China"ggplot()aes()thetic values
year for the x valuesgdpPercap for the y valuescountry for the color valuesgeom_line() as the geometry of the plottitle to the plot with labs()gm_df %>%
filter(country %in% c("Korea, Rep.", "Korea, Dem. Rep.", "Japan", "China")) %>%
mutate(total_GDP = pop * gdpPercap) %>%
ggplot(aes(x = year, y = gdpPercap, color = country)) +
geom_line() +
labs(title = "GDP Over Time")
arrange() Rowssample_df %>%
select(country, gdpPercap) %>%
prettify()
| country | gdpPercap |
|---|---|
| Australia | 34435.367 |
| Brazil | 9065.801 |
| Hungary | 18008.944 |
| Ireland | 40675.996 |
| New Zealand | 25185.009 |
| Nicaragua | 2749.321 |
| Nigeria | 2013.977 |
| Singapore | 47143.180 |
| Sri Lanka | 3970.095 |
| Tunisia | 7092.923 |
sample_df %>%
select(country, gdpPercap)%>%
arrange(gdpPercap) %>%
prettify(cols_changed = 2)
| country | gdpPercap |
|---|---|
| Nigeria | 2013.977 |
| Nicaragua | 2749.321 |
| Sri Lanka | 3970.095 |
| Tunisia | 7092.923 |
| Brazil | 9065.801 |
| Hungary | 18008.944 |
| New Zealand | 25185.009 |
| Australia | 34435.367 |
| Ireland | 40675.996 |
| Singapore | 47143.180 |
Use arrange() to sort rows.
arrange() by ascending number (smallest to largest)
arrange() gm_df’s pop column so that smallest populations are on topgm_df %>%
arrange(pop) %>%
prettify(cols_changed = 5)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Sao Tome and Principe | Africa | 1952 | 46.471 | 60011 | 879.5836 |
| Sao Tome and Principe | Africa | 1957 | 48.945 | 61325 | 860.7369 |
| Djibouti | Africa | 1952 | 34.812 | 63149 | 2669.5295 |
| Sao Tome and Principe | Africa | 1962 | 51.893 | 65345 | 1071.5511 |
| Sao Tome and Principe | Africa | 1967 | 54.425 | 70787 | 1384.8406 |
| Djibouti | Africa | 1957 | 37.328 | 71851 | 2864.9691 |
| Sao Tome and Principe | Africa | 1972 | 56.480 | 76595 | 1532.9853 |
| Sao Tome and Principe | Africa | 1977 | 58.550 | 86796 | 1737.5617 |
| Djibouti | Africa | 1962 | 39.693 | 89898 | 3020.9893 |
| Sao Tome and Principe | Africa | 1982 | 60.351 | 98593 | 1890.2181 |
arrange() by desc() number (largest to smallest)
arrange() the lifeExp column so that largest values are on topgm_df %>%
arrange(desc(lifeExp)) %>%
prettify(cols_changed = 4)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Japan | Asia | 2007 | 82.603 | 127467972 | 31656.07 |
| Hong Kong, China | Asia | 2007 | 82.208 | 6980412 | 39724.98 |
| Japan | Asia | 2002 | 82.000 | 127065841 | 28604.59 |
| Iceland | Europe | 2007 | 81.757 | 301931 | 36180.79 |
| Switzerland | Europe | 2007 | 81.701 | 7554661 | 37506.42 |
| Hong Kong, China | Asia | 2002 | 81.495 | 6762476 | 30209.02 |
| Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.37 |
| Spain | Europe | 2007 | 80.941 | 40448191 | 28821.06 |
| Sweden | Europe | 2007 | 80.884 | 9031088 | 33859.75 |
| Israel | Asia | 2007 | 80.745 | 6426679 | 25523.28 |
arrange() alphabetically
filter() gm_df to keep only those rows where year == 2007 and continent == "Americas"arrange() the country column alphabeticallygm_df %>%
filter(year == 2007, continent == "Americas") %>%
arrange(country) %>%
prettify(cols_changed = 2:3)
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.380 |
| Bolivia | Americas | 2007 | 65.554 | 9119152 | 3822.137 |
| Brazil | Americas | 2007 | 72.390 | 190010647 | 9065.801 |
| Canada | Americas | 2007 | 80.653 | 33390141 | 36319.235 |
| Chile | Americas | 2007 | 78.553 | 16284741 | 13171.639 |
| Colombia | Americas | 2007 | 72.889 | 44227550 | 7006.580 |
| Costa Rica | Americas | 2007 | 78.782 | 4133884 | 9645.061 |
| Cuba | Americas | 2007 | 78.273 | 11416987 | 8948.103 |
| Dominican Republic | Americas | 2007 | 72.235 | 9319622 | 6025.375 |
| Ecuador | Americas | 2007 | 74.994 | 13755680 | 6873.262 |
group_by() for Grouped Datasample_df %>%
select(country, continent, pop) %>%
prettify()
| country | continent | pop |
|---|---|---|
| Australia | Oceania | 20434176 |
| Brazil | Americas | 190010647 |
| Hungary | Europe | 9956108 |
| Ireland | Europe | 4109086 |
| New Zealand | Oceania | 4115771 |
| Nicaragua | Americas | 5675356 |
| Nigeria | Africa | 135031164 |
| Singapore | Asia | 4553009 |
| Sri Lanka | Asia | 20378239 |
| Tunisia | Africa | 10276158 |
sample_df %>%
select(country, continent, pop) %>%
group_by(continent) %>%
mutate(pop_by_continent = sum(pop)) %>%
ungroup() %>%
arrange(pop_by_continent) %>%
prettify(cols_changed = 4)
| country | continent | pop | pop_by_continent |
|---|---|---|---|
| Hungary | Europe | 9956108 | 14065194 |
| Ireland | Europe | 4109086 | 14065194 |
| Australia | Oceania | 20434176 | 24549947 |
| New Zealand | Oceania | 4115771 | 24549947 |
| Singapore | Asia | 4553009 | 24931248 |
| Sri Lanka | Asia | 20378239 | 24931248 |
| Nigeria | Africa | 135031164 | 145307322 |
| Tunisia | Africa | 10276158 | 145307322 |
| Brazil | Americas | 190010647 | 195686003 |
| Nicaragua | Americas | 5675356 | 195686003 |
group_by() allows us to group rows together based on column values.
Let’s say we wanted to compute summary values for each country for all years.
mean_gdp_per_cap of each country with group_by()
gm_df and group_by() country to group rows of the same country togethermean() to calculate the mean_gdp_per_capungroup() the rows
distinct() combinations of country and mean_gdp_per_cap
distinct()’s default is to only keep columns used as argumentsgm_df %>%
group_by(country) %>%
mutate(mean_gdp_per_cap = median(gdpPercap)) %>%
ungroup() %>%
distinct(country, mean_gdp_per_cap) %>%
prettify(cols_changed = 2)
| country | mean_gdp_per_cap |
|---|---|
| Afghanistan | 803.4832 |
| Albania | 3253.2384 |
| Algeria | 4853.8559 |
| Angola | 3264.6288 |
| Argentina | 9068.7844 |
| Australia | 18905.6034 |
| Austria | 20673.2530 |
| Bahrain | 18779.8016 |
| Bangladesh | 703.7638 |
| Belgium | 20048.9102 |
ggplot() Exercise 4Steps
gm_df, group_by() the continent and yearmutate() to add a column called mean_gdp for the average GDP of each continentungroup() the data, because this is a habit that will save you headaches laterdistinct() combinations of continent, year, and mean_gdpggplot()aes()thetic values
year for the x valuesmean_gdp for the y valuescontinent for the color valuesgeom_line() as the geometry of the plottitle and a caption (for the source of the data) to the plot with labs()gm_df %>%
group_by(year, continent) %>%
mutate(mean_gdp = mean(gdpPercap)) %>%
ungroup() %>%
distinct(continent, year, mean_gdp) %>%
ggplot(aes(x = year, y = mean_gdp, color = continent)) +
geom_line() +
labs(title = "Mean GDPs by Continent Over Time",
caption = "Source: Free material from www.gapminder.org")
summarize()sample_df %>%
select(country, continent, lifeExp, pop) %>%
prettify()
| country | continent | lifeExp | pop |
|---|---|---|---|
| Australia | Oceania | 81.235 | 20434176 |
| Brazil | Americas | 72.390 | 190010647 |
| Hungary | Europe | 73.338 | 9956108 |
| Ireland | Europe | 78.885 | 4109086 |
| New Zealand | Oceania | 80.204 | 4115771 |
| Nicaragua | Americas | 72.899 | 5675356 |
| Nigeria | Africa | 46.859 | 135031164 |
| Singapore | Asia | 79.972 | 4553009 |
| Sri Lanka | Asia | 72.396 | 20378239 |
| Tunisia | Africa | 73.923 | 10276158 |
sample_df %>%
select(country, continent, lifeExp, pop) %>%
group_by(continent) %>%
summarise(max_pop = max(pop),
mean_life_exp = mean(lifeExp)) %>%
prettify(cols_changed = 2:3)
| continent | max_pop | mean_life_exp |
|---|---|---|
| Africa | 135031164 | 60.3910 |
| Americas | 190010647 | 72.6445 |
| Asia | 20378239 | 76.1840 |
| Europe | 9956108 | 76.1115 |
| Oceania | 20434176 | 80.7195 |
Now that we know how to use group_by(), we can summarize() data by group. This can be done using all of the summary functions seen earlier.
| Summary Functions | |
|---|---|
first() |
min() |
last() |
max() |
nth() |
mean() |
n() |
median() |
n_distinct() |
var() |
IQR() |
sd() |
gm_df and group_by() continentsummarize() or summarise(), calculate:
count with n()mean_pop with mean()max_gdp_per_cap with max()gm_df %>%
group_by(continent) %>%
summarise(count = n(),
mean_pop = mean(pop),
max_gdp_per_cap = max(gdpPercap)) %>%
prettify(cols_changed = 2:4)
| continent | count | mean_pop | max_gdp_per_cap |
|---|---|---|---|
| Africa | 624 | 9916003 | 21951.21 |
| Americas | 300 | 24504795 | 42951.65 |
| Asia | 396 | 77038722 | 113523.13 |
| Europe | 360 | 17169765 | 49357.19 |
| Oceania | 24 | 8874672 | 34435.37 |
ggplot() Exercise 5Steps
gm_df, filter() the data to remove rows where continent is not "Oceania"group_by() continent and yearsummarize() the groups by calculating them mean() of popungroup() the data, because this is a habit that will save you headaches laterggplot()aes()thetics
year for the x valuesmean_pop for the y valuescontinent for the color valuesgeom_line() for the first geometrygeom_point() for the second geometrytheme_minimal()facet_wrap(), split the plot into panels for each continent
~ is used as a formula to select the facet variabletitle and a caption with labs()gm_df %>%
filter(continent != "Oceania") %>%
group_by(continent, year) %>%
summarise(mean_pop = mean(pop)) %>%
ungroup() %>%
ggplot(aes(x = year, y = mean_pop,
color = continent)) +
geom_line() +
geom_point() +
theme_minimal() +
facet_wrap(~ continent) +
labs(title = "Mean Continent Populations over Time",
caption = "Source: Free material from www.gapminder.org")